Skip to content

[Feature][history server] support endpoint /events#4479

Open
seanlaii wants to merge 5 commits intoray-project:masterfrom
seanlaii:event-api
Open

[Feature][history server] support endpoint /events#4479
seanlaii wants to merge 5 commits intoray-project:masterfrom
seanlaii:event-api

Conversation

@seanlaii
Copy link
Contributor

@seanlaii seanlaii commented Feb 4, 2026

Why are these changes needed?

The Ray Dashboard provides an /events endpoint that returns cluster-level events for observability. When a Ray cluster is deleted, this data is lost. This PR implements the /events API for the History Server, enabling users to query historical RayEvents from deleted clusters with a response format compatible with the Ray Dashboard.

Background: Two Event Systems in Ray

Ray has two distinct event systems:

  1. Dashboard Cluster Events (event.proto) - Human-readable messages stored in logs/events/
  2. RayEvents / Export Events (events_base_event.proto) - Structured task/actor lifecycle data exported via the Export API

The History Server Collector stores RayEvents (Export Events), which contain rich task/actor data. This implementation transforms RayEvents into a Dashboard-compatible response format.

Summary

This PR adds:

  • Event types (Event, EventMap, ClusterEventMap) in types/event.go following the existing Task/Actor pattern
  • RayEvent transformation (transformToEvent, extractHostnameAndPid, extractJobIDFromEvent) in eventserver.go
  • /events handler in router.go supporting both live cluster proxy and historical data retrieval
  • Unit tests for Event types with comprehensive coverage
  • E2E tests for both live and dead cluster scenarios

Design Decisions

  • Max 10,000 events per job: Matches Ray Dashboard's MAX_EVENTS_TO_CACHE constant
  • FIFO eviction policy: Drop oldest events when limit exceeded, keeping newest data
  • Group by jobId or global: Node/lifecycle events without jobId are stored under global key
  • camelCase JSON field names: Match Ray Dashboard format for frontend compatibility
  • Use types.EventType constants: Type-safe event type references instead of raw strings

API Response Format

{
  "result": true,
  "msg": "All events fetched.",
  "data": {
    "events": {
      "<jobId>": [
        {
          "eventId": "...",
          "eventType": "TASK_DEFINITION_EVENT",
          "sourceType": "GCS",
          "timestamp": "2026-01-16T19:16:15.210327633Z",
          "severity": "INFO",
          "label": "TASK_DEFINITION_EVENT",
          "nodeId": "...",
          "sourceHostname": "ray-head-0",
          "sourcePid": 12345,
          "customFields": { ... }
        }
      ],
      "global": [ ... ]
    }
  }
}

Fields with Partial Availability

Unlike Dashboard Cluster Events which always populate sourceHostname and sourcePid, RayEvents only contain this information in specific nested event types:

  • sourceHostname: Extracted from NodeDefinitionEvent.hostname, available in NODE_DEFINITION_EVENT only
  • sourcePid: Extracted from workerPid, pid, or driverPid, available in TASK_LIFECYCLE_EVENT, ACTOR_LIFECYCLE_EVENT, DRIVER_JOB_DEFINITION_EVENT
  • message: Usually empty in RayEvents (data is stored in nested event fields instead)

NOTE: This is a fundamental difference between the two event systems in Ray. Dashboard Cluster Events are designed for human-readable logging, while RayEvents are structured for programmatic analysis.
Known Limitation: Event deduplication is not implemented. Duplicate events may occur if the same files are re-read (e.g., hourly refresh). This will be addressed in a follow-up issue along with the existing file tracking TODO.

Related issue number

Closes #4380

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
@Future-Outlier Future-Outlier self-assigned this Feb 4, 2026
@seanlaii seanlaii marked this pull request as ready for review February 4, 2026 19:39
Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.


// Extract sourceHostname and sourcePid from nested events where available
extractHostnameAndPid(event, eventType, data)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task profile event data may be silently lost

Medium Severity

The field taskProfileEvents is uniquely plural among all event data fields (others are singular like taskDefinitionEvent, actorLifecycleEvent). This naming suggests it may contain an array of profile events rather than a single map. The code only handles map[string]any type assertions - if the actual structure is []any, the type assertion silently fails, causing task profile data to not be captured in customFields and jobId extraction to fail (defaulting to "global" grouping).

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is not valid. The taskProfileEvents field is a single object (map), not an array. Here's the evidence:

  1. Proto Definition:
message TaskProfileEvents {
  bytes task_id = 1;
  int32 attempt_number = 2;
  bytes job_id = 3;
  ray.rpc.ProfileEvents profile_events = 4;  // the "events" array is INSIDE here
}
  1. Actual JSON Data
{
  "eventType": "TASK_PROFILE_EVENT",
  "taskProfileEvents": {
    "attemptNumber": 0,
    "jobId": "BAAAAA==",
    "taskId": "...",
    "profileEvents": {
      "componentId": "...",
      "componentType": "worker",
      "events": [...]  // <-- the array is HERE, inside profileEvents
    }
  }
}

@seanlaii
Copy link
Contributor Author

seanlaii commented Feb 7, 2026

Hi @CheyuWu @troychiu @justinyeh1995 , please help review the PR when you have a chance. Thank you!

var response map[string]any

if jobID != "" {
events := s.eventHandler.ClusterEventMap.GetByJobID(clusterSessionKey, jobID)
Copy link
Contributor

@justinyeh1995 justinyeh1995 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to double-check the behavior here, /events?job_id= would also return all events. Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should only return the events related to the specified job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] support endpoint /events

3 participants